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Abstract 

We present memory-efScient deterministic algorithms for constructing e-nets and e-approxima- 
', tions of streams of geometric data. Unhke probabiHstic approaches, these deterministic samples 

' provide guaranteed bounds on their approximation factors. We show how our deterministic samples 

, can be used to answer approximate online iceberg geometric queries on data streams. We use these 

(N . techniques to approximate several robust statistics of geometric data streams, including Tukey depth, 

simplicial depth, regression depth, the Thiel-Sen estimator, and the least median of squares. Our 
algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors 
are inverse-polylogarithmic. We also include a lower bound for non-iceberg geometric queries. 

■ . 1 Introduction 

o. 

■ With the proliferation of streams of packets on the Internet, as well as data streaming from embedded 
lyj , systems, digital monitors, sensor networks, and scientihc instruments, there is a need for new algorithms 
that can compute approximations or answer approximate queries on data streams. The main challenge 
in these contexts is that the data volumes are often much larger than the memory size of a typical 
computer. Thus, there is a considerable amount of interest in methods that can process data streams 
I using limited memory (e.g., see recent surveys by Muthukrishnan 23 and Babcock [2j)- The model we 
' choose to work in is the so called Time Series model in which each time instant reveals a new element 
of the data stream "signal." 
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I A typical approach in data streaming algorithms is to maintain a random sample of the input data 

' and perform computations on the sample with the hope that information about the sample can be used 
. to infer properties of the entire set. Naturally, such inferences come with an associated probability that 
Q [ they are inaccurate. In this paper, we are interested in deterministically constructing samples of a data 
stream that have guaranteed approximation properties for the original set. Moreover, because of the 
limited memory restriction of data streaming applications, we are interested in deterministic samples 
that can be constructed using space that is polylogarithmic in the data stream's length. 
I In addition, because much of the streaming data is coming from sensors and scientific instruments, 

we are interested in this paper in studying streaming algorithms for geometric data. Such data could 
include multi-dimensional points in the color space of astrophysical data or two-dimensional lines 
defined by a point-line duality of a stream of points in the plane. Of particular interest, then, is 
data streaming algorithms for constructing e-nets and e- approximations, which are general structures 
developed in the computational geometry literature for deterministically sampling geometric data. 
Indeed, e-nets and e- approximations are developed in a very general context of bounded-dimensional 
range spaces, where we are given a ground set and a polynomial-sized family of ranges on that set (which 
constitute the queries or sampling statistics we are interested in). Hence, results for constructing such 
deterministic samples should have a considerable number of applications. 
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1.1 Related work on Streaming Algorithms 

Data streaming problems have engendered a large amount of interest among the algorithms community 
over the last few years. For a comprehensive survey of the work done so far and some interesting 
directions for the future, the reader is referred to Muthukrishnan's work 27 . An earlier survey by 
Babcock et. al. 2 explores the issues arising in building data stream systems. 

We are not familiar with any previous work for constructing e-nets or e-approximations in streaming 
models, although these structures have been extensively studied in full- memory contexts (e.g., see the 
chapter by Matousek HH])- The closest previous work is done in the iceberg query 9 framework of 
Manku and Motwani |22] , who provide 1 + e approximations for the frequency counts of items in a data 
stream that occur more than eA^ times (which are the so-called "icebergs"). Similarly, algorithms for 
computing the quantiles of a data stream have been given by Greenwald and Khanna [12^ guaranteeing 
a precision of eA^, which is similar to the guarantees that are provided by e-approximations, while using 
0(- logeA^) space. This limitation of an additive eA^ error in every quantile is overcome by Gupta and 
Zane jl3j . The latter 's method provides relative error for all quantiles but uses 0(log2 N/e^) space and 
requires knowledge of an upper bound on the stream size. 

The first geometric problem to be studied in the streaming model was that of finding the diameter of 
a set of points. Feigenbaum, Kannan and Zhang JO] gave an 0(l/e) space algorithm for computing the 
diameter of points in two dimensions in the streaming model and a 0(^372 " log^ A^ (log ii + log log A^ + 
log(i))) space algorithm for computing it in the sliding window model where R is the maximum, 
over all windows, of the ratio of the diameter to the distance between the closest two points in the 
window. Indyk jl6j gave a streaming algorithm which maintains a c-approximate diameter of points 
in d dimensions using space taking 0(dnV(^'-i)) time per new point, for c > \/2. 

Cormode and Muthukrishnan generalized the exponential histograms used on single dimensional 
data sets in earlier works on streaming algorithms and defined radial histograms 16:, which 
allowed them to give a 0(1 + e) approximation to the diameter using 0(l/e) space. They were also 
able to use these structures to approximate convex hulls in the sense that no point in the input stream 
is more than eD outside the approximate hull, where D is the diamter of the point set. Constructing an 
approximate hull takes them 0{q/e) space. Hershberger and Suri jJSj improve this to give a sampling- 
based algorithm for approximating the convex hull of a streaming point set, showing how to maintain 
an adaptive sample of at most 2r points such that the distance between the hull of their sample and 
the true convex hull is 0(-D/r^), where D is the current diameter of the sample. Some of the other 
geometric problems that have been studied in a streaming model include minimum spanning tree and 
minimum weight matching 17^ and certain facility location and nearest neighbour kind of queries 

1.2 Our Results 

In this paper, we present memory-efficient deterministic algorithms for constructing e-nets and e- 
approximations of streams of geometric data. Our algorithms use a polylogarithmic amount of memory, 
provided e is at least inverse-polylogarithmic. As mentioned above, e-nets and e-approximations are of 
interest in their own right and have many applications in computational geometry. We show how our 
deterministic samples can be used to answer online iceberg geometric queries on data streams, such as 
in multi-dimensional iceberg range searching. Because the information typically of interest from data 
streams is statistical, we focus in this paper primarily on the use of e-nets and e-approximations to 
compute approximations to several robust statistics of geometric data streams, including Tukey depth, 
simplicial depth, regression depth, the Thiel-Sen estimator, and the least median of squares. Thus, 
we additionally give polylogarithmic-space data streaming algorithms for computing approximations 
to these statistics. We also include a lower bound for non-iceberg range queries in data streams. 
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2 Preliminaries on e-Nets and e- Approximations 



In this section we recap certain aspects of e-Nets and e-approximations |34l I26j . which are part of 
a general framework for modelhng a number of interesting problems in computational geometry and 
derandomizing divide- and-conquer type algorithms. 

A range space is a set system, i.e., a pair S = {X^IZ), where X is a set and 7^ is a set of subsets 
of X. We call the elements of TZ the ranges of S, as 7^ is typically defined in terms of some well 
structured geometry. If y is a subset of X, we denote by TZ\y the set system induced by TZ on Y, i.e., 

{RnY\R G ny. 

We say a subset Y Q X is shattered if every possible subset of Y is induced by TZ, i.e., if TZ\y = 2^. 
The VC-dimension of S is the maximum size of a shattered subset of X. If there are shattered subsets of 
any size, then the VC-dimension is infinite. A related and simpler notion is the scaffold dimension 
of E. It is based on the notion of the shatter function Tr-ji^m), which we define as the maximum 
possible number of sets in a subsystem of S induced by an m-sized subset of X. In other words, it 
is the sup{|i?|y| : y C X, \Y\ = m}. We now define the scaffold dimension of (X, 7^) as the infimum 
of all numbers d such that 'Kfi{m) is 0{m'^). It turns out that the shatter function of a set system of 
VC-dimension d' is bounded by CS") + (?)+••• + (I?) = @{m'^') Thus the scaffold dimension is 

always at most the VC-dimension. Conversely, if the scaffold dimension is bounded by a constant, the 
VC-dimension too is bounded by a constant. There are, however, many natural geometric set systems 
of scaffold dimension strictly smaller than the VC-dimension; for instance, the scaffold dimension of 
a set system defined by halfplanes in the plane is 2, while the VC-dimension is 3. In the rest of the 
paper, we will always refer to the scaffold dimension of a set system. In addition, we consider only 
those set systems whose scaffold dimensions are bounded by a constant. 

We are now ready to define e-nets and e-approximations. A subset 5 C X is an e-net for (X, TZ) 
provided that S H 7^ for every R G TZ with |i?|/|X| < e. A subset A C X is an e- approximation for 
(X, TZ) provided that 

\AnR\ \xr\R\ 

~iA\ \xr ^ ^ 

for every set R €z TZ. Note that every e-approximation is automatically an e-net, but the converse need 
not be true. A remarkable property about set systems of scaffold dimension d is that, for any e S [0, 1), 
they admit an e-approximation whose size depends only on d and e, not on the size of X. The first 
basic result in this vein is the following lemma. 

Lemma 2.1 For any set system {X,TZ), with a finite X, and a scaffold dimension at most d, where 
d>l, there exists, for any e G [0,1], an e-net of size at most Cie~^ lg(e~-'^), and an e-approximation 
of size at most C2e~^ lg(e~^). Here Ci,C2 depend on only d. 

Note that, in general, the lg(e~^) factor cannot be removed from the bound. 

Matousek [25] gave a deterministic algorithm for efficiently computing small sized e-approximations 
(and thereby, e-nets) for set systems with constant-bounded scaffold dimensions. Such an algorithm 
needs that the set system to be given in a form more "compact" than simply the listing of the elements 
in each set. For this we assume the existence of a subsystem oracle, i.e. an algorithm (depending on 
the specific geometric application) that, given any subset y C X, lists all sets of 7^|y. We say that the 
subsystem oracle is of dimension at most d if it lists all sets in time Odyl*^^"^). This corresponds to 
the scaffold dimension; the maximum number of sets in TZ\y is 7r7j(|y|), and the in the exponent 

accounts for the fact that each output set is given by a list of size up to \Y\. Matousek's result is 
summarized by the following lemma. 

^Note that although many sets of TZ may intersect Y in the same subset, this intersection appears only once in TZ\y- 
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Lemma 2.2 Let {X,TZ) be a set system with a subsystem oracle of dimension d, where d is a constant. 
Given any e G [0,1), we can compute an e- approximation of size 0(e~^ lg(e~-'^)) and an e-net of size 
0(e-Mg(e-i)) in time 0{\X\e'^'^lg'^{e-^)). 

We shall use the algorithm above as a sub-routine for our streaming algorithm for e-approximations 
(see Section^. It is based on two observations that we state below. They correspond to two basic 
operations of our algorithm, the merge step and the reduce step. Many algorithms for computing 
e-approximations (certainly the one Matousek gave, and the one we shall give) start by partitioning 
X into small pieces, and then essentially alternate between the two steps until they get the desired 
approximation. 

Observation 2.3 (Merge Step) Let Xi, . . . , Xm ^ X be disjoint subsets of equal cardinality and 
let Ai be an e- approximation of cardinality b for {Xi,TZ\xi),i = 1, . . . ,m. Then U . . . U Afn is an 
e- approximation for the subsystem induced by TZ on XiU . . . U Xm- 

Observation 2.4 (Reduce Step) Let A be an e- approximation for {X,TZ) and let A' be a 5 -approximation 
for {A,1Z\a)- Then A' is an (e + 5) -approximation for {X,TZ). 

Before we end this preliminary section, we state the following extension to Lemma 12.21 This, too, 
was given by Matousek PS_ . 

Lemma 2.5 Let X be a finite set equipped by a probabilistic measure fi (given by a table) and let 
S = {X,TZ) be a range space satisfying the assumptions of Lemm,a \2.Si Then an e- approximation for 
S with respect to the measure fi can be computed with the same asymptotic efficiency in the running 
time and size of the e- approximation in the case of uniform measure in Lemma \2.l\ 

When X is associated with a probabilistic measure /i, an e-approximation of {X,TZ) is a multi-set A 
such that 

■|Ani?| fi{xnR) 



\A\ ^{X) 



< e 



for every R £ TZ. 



3 Some Additional Extensions for Weighted Sets 



While the extensions described above are useful in our context, we nevertheless need some further 
generalizations, which will be useful in the data streaming model. In particular, we need to generalize 
Observation 12.31 to a weighted case. This allows us to merge e-approximations of different sizes and for 
sets of different cardinalities. To the best of our knowledge, this is the first time such an observation is 
being made. Note that in the un- weighted case, for an e-approximation A for {X,TZ), each element in 
A "represents" |X|/|A| elements in X. This is easy to see if we write Requirement ^ in the following 
form 

I I 1-11 

Now, instead of having an element p in the e-approximation A represent the same number of elements 
in X, we can assign it a weight 7(p) equal to the number of elements in X that it represents. In this 
generalized scenario, a subset A C X, is a weighted e-approximation for {X,TZ) provided that 

l{p)-\XnR\ <e\X\. 

peAnR 

We are now ready to state our observation related to weighted merging. 
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Observation 3.1 (Weighted Merge Step) LetXi, . . . ,Xm ^ X he disjoint subsets (of cardinalities 
not necessarily the same) and let Ai he a weighted e- approximation of {Xi,lZ\xi)-,i = 1, . . . ,m. Then 
U . . . U Am is a weighted e- approximation for the subsystem induced by IZ on U . . . U X^, where 
the weights on the points remain as they were. 

4 Computing e- Approximations in Geometric Streams 

Let xi, . . . ,Xn, ■ ■ ■ be a stream of geometric objects in the time series model. Let X be tlie set of all the 
objects in the stream that have arrived till now. Let 7?. be a set of ranges defined on X, and S = {X, IZ) 
be the current range space. In addition, let d, where d is a constant, be the scaffold dimension of S. 

We present an algorithm which computes an e- approximation to (X, TZ), given any e G [0, 1). Our al- 
gorithm maintains a polylogarithmic-sized data structure from which it computes this e-approximation. 
Additionally, it takes polylogarithmic time to update this structure on the arrival of a new item in the 
stream. 

The e-approximation our algorithm produces is asymptotically equal in size to that produced in 
the static model (see Lemma |2.2|) . Interestingly, our algorithm does not need to know the value of 
n in advance. Our algorithm simulates the divide-and-conquer approach of the static algorithm in a 
bottom up fashion. We now outline how this is done. 

We begin by imposing a hierarchy of groupings onto the stream: define canonical sets Sj^k as 
{xi\j2^ < i < {j + 1)2'^} for j, k > 0. Canonical sets are inter-related through a natural tree hierarchy. 
The children of set Sj^k, k > 1, are the canonical sets S2j,k~i and S2j+i,k-i- We say that a canonical 
set Sj^k becomes available when the last element in it, i.e., X(j_,_x)2ft-i) arrives. A maximal canonical set 
is one that is available but whose parent is not yet available. Observe that when x„ arrives, there are 
at most Ig n maximal canonical sets. Also, the union of all the maximal canonical sets is the set X of 
all elements that have arrived till now. 

We use the following building blocks. 

• e-approx(): An algorithm for deterministically computing e-approximation with small size (see 
Lemma IT^ ) 

• weightecl_e-approx(): An algorithm for deterministically computing e- approximations of weighted 
items (see Lemma 12.51 ) 

Note that we cannot afford to use e-approx() on an input that is larger than logarithmic, as otherwise 
we will not remain within our space and time bounds. 

Our algorithm, we call it e-stream_approx(), follows the basic merge and reduce technique j^H] for 
constructing e-approximations. To follow this technique we need to use a sequence wi, . . . , Wu, ■ ■ ■ with 
the property that W = Yl'^=i '^u = 0{1). Here we shall use Wi = i~^~'^, for some c > 0. 

At a high level the algorithm is as follows (see Figure^: At every stage, the algorithm stores a 
^-approximation for all available maximal canonical sets, where 8 varies with the set, but is always at 
most e/2. Let Aj k be such an approximation for Sj^k- This ^-approximation is constructed through 
merging the approximations ^2j,fc-i and ^2j+i,A:-i which were earlier computed for Sj^k^ two children. 

The e-approximation of the set X at any point, the stream output., is determined by weighted 
merging. Each element p G Aj^^ is assigned a weight 7(p) = IS'j^fcl/l for this purpose. As it 
happens, once a weight is assigned to an object, we don't ever need to change it. 

We are now ready to formally specify e-stream_approx(). Figure El contains the specification. Assume 
that Ajfi is the element itself in the singleton set Sj^. 
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■ Past stream item 

■ Current item 

: Avaiiable set 



Maximai set 



Weighted merge 



lUlerge and reduce 



time 

Figure 1: Schematic: Computing an e- approximation of a data stream 



e-stream_approx() 

When the next element x„ in the stream arrives 

For each canonical set Sj^k that becomes available, taken in the order of increasing k, where fc > 1 
/* Combine approximations of its children using (un-weighted) merge to get approximation 
for the parent */ 

B ^ A2j^k-l U ^2j + l,A;-l- 

/* Reduce the size of the approximation */ 
Aj^k ^ • Wfe/Ty)-approximation of B using e-approx(). 
/* Assign weights to elements */ 
For ah p e Ajx- l{p) ^ \Sj,k\/\Aj,k\- 
/* Combine approximations of maximal canonical sets using weighted merge to get approximation 
for the stream */ 

j4' <— y^. ^ available ^J '^" E^ch element in A' retains its weight from its orginal Aj,k- 

/* Reduce the size of the approximation */ 

A ^ (e/2)-approximation oi A' using weighted_e-approx(). 

Output A. 



[current Output | 

t t 




Figure 2: Algorithm for computing an e- approximation of a geometric stream. 

Correctness, Space, and Time. Observations 12.31 and 12.41 imply that Aij is a ^-approximation for 
where 

- Z^2 W 2 

u=l 

Together with Observation 13. 11 this implies that A' is a weighted (e/2)-approximation for the set X of 
elements in the stream. Now bring in the properties of weighted_e-approx() and another application of 
Observation 12.41 to see that A is indeed an e- approximation of {X,TZ). 

The data structure needs to store just the '^j^^'s; all other sets are intermediate results that can 
be discarded. Lemma implies that the size of ^j,fc, is 0((e • iffc)~^lg((e • Wk)~^)); remember that 
the size is determined by just the last reduction step. Denote the size of largest such set, i.e., ^j,ign) 
by s = s{n, e~^), which is 0{lg'^~^'^'^ n ■ • (Iglgn — Ige)). Note that s is also an upper bound for the 
size of the input to e-approx(). 

Consider the space and time requirements. These are dominated by the requirements for weighted_e- 
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approx(). The input to weighted_e-approx()is 0(lgn • s). Lemma [2.51 implies that the space and time 
requiments for e-stream_approx() are 0(lg^^^'^n • e"^'^"^ lg(e"^) • (Iglgn — Ige))- 

5 Applications 

e-Nets and e- approximations have a number of applications in computational geometry, and even other 
areas like learning theory — see, e.g., 24 . Many of the problems in these have streaming versions. 
One basic application is range counting. In this, we are given a set of n points in R'^, and a family 
TZ (the ranges) of subsets of M'^. Each query consists of a range R ^ TZ and asks for the number of 
points in it. Typical range families are axes-orthogonal ranges, spherical ranges (proximity queries), 
and simplical ranges. The corresponding range spaces for these all have a bounded scaffold dimension. 
In the streaming version, the point set S comes as a continuous stream, interspersed with queries. It is 
easy to see how our algorithm would work here: use e-stream_approx() to maintain an e-approximation 
A of the current {S,TZ). When queried with range R G TZ, output |yl n i?| • n/|^|. This is within an 
additive ere of the true value; this is akin to the iceberg queries mentioned earlier. 

The above technique has implications in a lot of specific applications. To get a flavor of this, we 
delve deeper into the specific area of robust statistic in the next few paragraphs. 

5.1 Robust Statistics 

Robust statistics concerns the study of statistical estimators that can tolerate high numbers of outliers, 
while maintaining an accuracy of estimation that depends only on the remaining uncorrupted data 
points. In contrast, ordinary least squares estimators, while trivial to compute even in the streaming 
model, can be forced to produce estimates that are arbitrarily far from the correct model even in the 
presence of a single outlier. The number of outliers that an estimator can tolerate while preserving its 
accuracy is called its breakdown point; in general, methods with high breakdown points are preferred but 
other criteria are also important including statistical efficiency (number of samples needed to achieve a 
given accuracy) and computational efficiency (amount of time it takes to compute a given estimate from 
a set of samples). Many robust statistical methods also have the advantage of being non-parametric, 
not requiring the statistician to produce a prior probability distribution or other arbitrary parameters 
before producing a fit. The paradigmatic example of a robust statistic is the median of one-dimensional 
data, which, unlike the mean, is robust with a breakdown point of ^. Much research on streaming 
algorithms has gone into methods for maintaining approximate medians or more general quantiles ^2j , 
and we would like to find similar methods for higher dimensional statistics. 

Two of the critical problems studied in robust statistics are location (finding a central point in a 
cloud of data points) and regression (fitting the data to a model in which a dependent variable or 
variables is a linear function of the independent variables). Many methods in this area are based on 
various concepts of depth, which measures the quality of fit of an estimate. It is natural to seek the 
estimate maximizing the depth, but it is also of importance to be able to compute depths of non-optimal 
estimates, in order to form depth contours that produce a center-outward ordering of the data. 

For many of these robust statistical methods, a computationally efficient streaming approximation 
to the depth measure can be obtained from an e-approximation of the sample data. The deepest fit can 
be approximated by a deepest fit to the e-approximation, and this approximate fit often has similar 
breakdown point properties to the non-approximate fit on which it is based. We describe below several 
of the methods to which this technique applies: 

Tukey Depth. This quantity measures the quality of fit of a center, as the minimum proportion 
of sample points among all halfspaces that contain the center. The Tukey depth of a point can be 
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computed in time O(n'^logn), where n denotes the number of sample points [30]. The Tukey median 
is the point of maximum depth. It is known that any Tukey median has depth at least l/{d + 1), 
and the breakdown point of the Tukey median as an estimate of location is also + 1). There 
are known static algorithms for finding Tukey medians, or other points of high depth, in two or 
three dimensions (18^ 1201 123j , but in higher dimensions only inefficient linear-programming based exact 
solutions are known and it is necessary to resort to more efficient approximation algorithms 1^. 

Since the Tukey depth is based on counting points in halfspaces, it can be approximated using e- 
approximations for halfspace ranges |5: : the depth of a point within an e-approximation of a sample is 
within an additive error of e of its depth in the original sample data. In particular, the Tukey median 
of an e-approximation has depth within e of that of the true Tukey median. The breakdown point 
of this approximate Tukey median is + 1) — e. Thus, by using our streaming e-approximation 
algorithm, we can efficiently maintain not only an approximate Tukey median of the data set, but 
also a space-efficient data structure from which we can compute accurate approximations of the Tukey 
depth of any point. 

Simplicial Depth. This is another measure of quality of fit for location, introduced by Liu |2j. The 

simplicial depth of a fit point is defined to be the proportion of simplices, among all the {J^i) simplices 
formed by convex hulls of {d + l)-tuples of sample points, that contain the fit point. Equivalently, it 
is the probability that a randomly chosen (d + l)-tuple contains the fit point in its convex hull. As we 
now argue, for points in the plane, the simplicial depth in a sample set is accurately approximated by 
the simplicial depth of an e-approximation for wedge ranges (that is, ranges formed by intersecting two 
halfplanes). Therefore, as for Tukey depth, we can answer approximate depth queries and maintain an 
approximate deepest point in a space-efficient manner for streaming data. 

Let 5 be a value to be determined later and imagine the following process for measuring approx- 
imately the simplicial depth of a fit point: first, let L be a set of 1/(5 lines through the fit point, 
partitioning the plane into 2/6 wedges having the fit point as a common apex, with at most a 6 frac- 
tion of the sample points in any wedge. Let ei be the proportion of triangles, determined by three 
input points, that are not all on one side of one of a line in L. Then ei is an overestimate of the 
simplicial depth, but the amount by which it overestimates the depth is 0{6): the only triangles in- 
correctly included in the estimate are ones that have two points in opposite wedges, there are 0((5^n^) 
such triangles per pair of opposite wedges, and 0(1/5) such pairs. Next, let 62 be the proportion of 
triangles, determined by three points in an e-approximation of the sample, that are not all on one 
side of a line in L. For the same reasons as before, 62 is within 0{6) of the simplicial depth for the 
e-approximation. Further, ei and 62 are within 0{e/6) of each other: 



where Wi is the number of sample points in the ith wedge and hi is the number of sample points 
in the halfplane containing the ith wedge on its counterclockwise boundary. Each term in the sum is 
approximated within 0(e) by the corresponding term where Wi and hi are replaced by numbers of points 
in the e-approximation, and there are 0{l/6) terms, so the total difference between ei and 62 is 0{e/5). 
Putting together the errors in going from the original simplicial depth to ei to 62 to the simplicial depth 
of the approximation, and setting 5 = y^, we see that the e-approximation approximates the simplicial 
depth to within 0{^/e). 

As far as we are aware, this deterministic e-approximation based method for approximating simpli- 
cial depth is novel even for static, non-streaming data, although it is trivial to approximate simplicial 
depth randomly in the static case by sampling triangles. It seems likely that similar deterministic and 




8 



streaming approximation guarantees, with worse dependence on e, can be shown to hold also in higher 
dimensions. 

Regression Depth. This statistic was introduced by Rousseeuw and Hubert j28j as a measure of 
the quality of fit of a regression hyperplane. It is defined as being the minimum proportion of sample 
points that can be removed to turn the fit plane into a nonfit, that is, a hyperplane combinatorially 
equivalent to a vertical hyperplane. Amenta et al. [T] showed that, like Tukey depth, for regression 
depth a fit always exists with depth at least + 1), and the breakdown point of the maximum- 
depth fit is l/{d + 1). Their proof technique shows that the regression depth of a query hyperplane 
can be measured by performing a certain projective transformation of the space containing the sample 
points, and measuring the Tukey depth of a certain point in the transformed space. Due to the 
transformation, a halfspace in the transformed space may correspond to a double wedge (symmetric 
difference of two halfspaces) in the original space. Therefore, the same e-approximation technique 
used for Tukey depth, but with double wedge ranges, also applies to regression depth, and lets us 
compute depths and maintain an approximate deepest fit with high breakdown point for streaming 
data. Bern and Eppstein [2] generalized regression depth to the context of multivariate regression, 
in which the sample data have more than one dependent variable; in their definition, the depth of a 
fit is the minimum proportion of sample data contained in any double wedge, one boundary of which 
contains the fit and the other of which is parallel to the dependent coordinate axes; this is again well 
approximated by e- approximations for double wedge ranges. 

The Thiel-Sen Estimator. This estimator |32[ l33j is a method for two-dimensional linear regression, 
in which one first finds the median among all (2) slopes determined by the lines through pairs of sample 
points, and then selects a regression line having that median slope and bisecting the sample set. It 
has a breakdown point of 1 — \/T/2 ~ 0.293. This has long been a testbed for geometric optimization 
algorithms, and several 0(n log n) time static algorithms for it are known, among them one based 
on using e-cuttings in a prune-and-search technique 1^. However these algorithms seem to require 
repeatedly scanning the data in a way that is unavailable to a streaming algorithm. Instead, we apply 
an approximation technique very similar to that for simplicial depth, above. 

To begin with, suppose that we are given a query slope s, and must determine the approximate 
position of s within the sorted sequence of slopes, normalized by dividing the position by (2). This 
can be solved exactly by a reduction to computing the number of inversions in a permutation, but we 
are interested in approximations that can be computed by a streaming algorithm that does not know 
s in advance. To do this, let 5 be a parameter to be determined later, and imagine subdividing the 
sample points into a grid by 0{1/S) lines that are vertical and parallel to s, in such a way that at most 
a 6 proportion of the points lie in the slab between any two adjacent parallel grid lines. Let ei be an 
estimate of the position of s, formed by summing up the normalized number of pairs of points that form 
a line with lower slope than s and that are in a pair of grid cells that are separated both by a vertical 
line of the grid and by a line parallel to s from the grid. Then ei is within 0{6) of the true position 
of s since the only lines through a given point that are omitted from the count are the ones where the 
other point determining the line is in one of the two slabs containing s, and ei can be expressed as 
a sum with 0{6~'^) terms, each term being a product of the number of points in two parallelograms. 
Let 62 be a similar normalized sum, with the number of sample points in each parallelogram replaced 
by the number of points of an e-approximation for parallelogram ranges, and let 63 be the normalized 
position of s within the set of lines determined by pairs of points from the e-approximation. Then ei 
differs from 62 by 0{e6~'^) and 62 differs from 63 by 0{6 + e6~^). Therefore, the overall error caused 
by using 63 as our approximation to the position of s is 0{6 + e5~'^). Setting 6 = e^/^ makes this total 
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error equal 0{e^/^). 

To compute an approximate Thiel-Sen estimator, we use the same e-approximation for parallelo- 
grams, compute the median slope among pairs of points from the approximation, and then find a line 
with that median slope bisecting the approximation. The resulting line has slope with a normalized 
position within 0{e^^^) of the median slope, partitions the sample points within e of exact bisection, 
and has a breakdown point of 1 — — 0(e^/^). 

Least Median of Squares (LMS). These methods [21] in robust statistics seek a fit that minimizes 
the median residual value separating the fit from the sample points. This is not a depth-based criterion, 
but it leads to fits which are highly robust against outliers. For location problems, the least median of 
squares fit is the center of the minimum radius sphere that contains at least half of the sample data . 
It has a breakdown point of |: if fewer than half the sample data points are outliers, then the sphere 
defining the LMS fit has smaller radius than the circumsphere of the non-outliers, and it contains at 
least one non-outlier, so its center must be an accurate fit. Clearly, this is the best breakdown point 
possible for any location method. The natural type of e-approximation to use for this problem is one 
with balls as its ranges. If we form the LMS fit of such an e-approximation, the result may not be 
robust. Instead, we approximate the LMS fit by finding the center of the minimum radius sphere that 
contains at least a ^ + e proportion of the points in the e-approximation. Such a sphere must therefore 
contain at least half of the sample data, and has a radius at least as small as the smallest sphere 
containing at least a ^ + 2e fraction of the sample data. It is robust with a breakdown point of ^ — 2e. 

The same LMS approach can also be applied to regression problems. The least median of squares 
regression hyperplane can be defined as the central hyperplane in a slab bounded by two parallel 
hyperplanes, with minimum vertical separation between them, that contains at least half of the sample 
data; again this is robust with a breakdown point of ^. As above, we can use an e-approximation, 
with slab ranges, and find the slab with minimum vertical separation containing a ^ + e fraction of the 
e-approximation points, to produce an approximate LMS fit with breakdown point ^ — 2e. 

6 A Lower Bound on Range Counting 

We provide a simple lower bound on the space required to count approximately the number of items in 
a range that is not necessarily an iceberg. When we say that an algorithm /-approximates the range 
counting problem we mean that if a given range contains / points, the algorithm gives us an answer 
which lies between /// and / • /. 

The bound is stated in terms of two-sided ranges: a point (x, y) is said to belong to the two sided 
range located at {p, q) if x > p and y > q. 

Theorem 6.1 Any f -approximate algorithm to the two-sided range counting problem must use space 

n{n/f). 

We begin by assuming there is an algorithm A which gives an / approximation to the two-sided 
range counting problem for a stream of points in two dimensions. Further we assume that this algorithm 
uses space o(n//^). 

Now consider a set of n points which are grouped in n//^ equally sized groups, we call them Gi, 
where 1 < i < n/ f'^, in the following way: 

• Each point in Gi has the same x coordinate, we call it Xj. Additionally, we require Xj > 

• All the points in Gi have y coordinates closely clustered at a given value, we call it j/j. Formally, 
for every pj £ Gi, we say that < y{pj) — yi < 1/2. 
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• Every point pj S Gi has y-coordinate strictly smaller than the y-coordinates of all the points in 

• Each group i has an additional point qi = (xi + e,yi + e), for some e < 1/2, associated with it. 

Note that this family of input sequences has the property that a two-sided query made at {xi,yi) 
should return a count of + 1 and one made at (xj + 1 , + |) should return a count of 1. This radical 
change in the counts will not occur between two such queries at any point which is not actually {xi,yi) 
for some value of i. As an extension to this simple observation, we note that if all the XjS and y^s are 
chosen out of the integers 1, 2, ... n, it is possible to extract the exact values of all the Xi with exactly 
2n? queries. 

Let us see if the algorithm A can be the query mechanism which we can deploy to this end. Since 
^ is an / approximation, it should return a value of at most / at (xj + |, y^ + |) and a value between 
/ + 1// and + / at (xj,yi). This means that A can indeed act as the oracle which identifies the 
locations of the groups in our set. 

Hence, using ^ as a subroutine we can extract information about the input set. This 

contradicts the assumption that A uses space o{n/ f"^). □ 

Seen in the context of streaming algorithms, Theorem lb.ll implies that is not possible to approximate 
the range counting problem in polylogarithmic space. One of the implications of this, among others, 
is that it is not possible to count inversions in lists ll3| in the sliding window model. 

Acknowledgments. We would like to thank David Mount for helpful discussions of robust statistics in 
the context of the topics of this paper, and S. Muthukrishnan for helpful discussions on geometric streaming 
algorithms in general. 
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